Method of managing web sites registered in search engine and a system thereof

ABSTRACT

Disclosed is a method and system for managing web sites registered in a search engine that provides information about web sites on the Internet, wherein information about the web sites registered in the search engine is analyzed to prevent the provision of search results different from essential contents contained in the web sites. In the method, information of the registered web site is received and recorded in a database after being classified by predetermined fields. A search robot is controlled to read a source file constituting a web page of the registered web site, and the read source file is then analyzed. It is determined based on a predetermined basis whether or not the registered web site is a deceptive site. Predetermined processing is performed on the registered web site if the web site is determined to be a deceptive site. The source file is preferably an HTML document.

TECHNICAL FIELD

The present invention relates to a search engine for providinginformation about web sites on the Internet, and more particularly to amethod for managing web sites registered in a search engine, whereininformation about the web sites registered in the search engine isanalyzed to prevent the provision of search results different fromessential contents contained in the web sites.

BACKGROUND ART

A conventional search engine, such as Altavista(http://www.altavista.com), Lycos (http://www.lycos.com) or Yahoo(http://www.yahoo.com), generally includes a database for classifying,storing and managing web site information based on a predetermined rule,a search robot, embodied as software, for constantly traveling over theweb and automatically collecting new web site information, and searchengine software for storing the collected data in a database andallowing a user of the search engine to search for desired informationin the database.

FIG. 1 a is a block diagram showing an entire system for providing thesearch engine service. As shown in FIG. 1 a, a user connects to a searchengine server 150 over the Internet via a user terminal 110. If the userenters search terms, a search engine server 150 queries search enginesoftware 140 about web site information corresponding to the enteredsearch terms, and the search engine software 140 searches a database 130to notify the user of retrieved web site information. A search robot 120is an entity embodied as software for constantly traveling over the weband automatically collecting new web site information from a web server160, as described above. The search robot 120 searches for HTML(Hypertext Markup Language) documents on a network and parses linksdescribed in the HTML documents and then collects data from a number ofweb sites existing on the network. The data collected by the searchrobot 120 is databased. The term “databased” refers to a series ofprocesses of performing morphological analysis of information located ona web site and producing a corresponding index table and storing it inthe database 130. The database 130 is provided to store all web siteinformation collected by the search robot 120. The search enginesoftware 140 functions to show search results to users. This softwaresearches a large number of pages stored in the database 130 and listssearch results by relevance to the search term. The conventional searchengine as described above registers information about a web site in asearch engine and provides the information to users in the followingways.

(1) Information of a web site is collected using the search robot asdescribed above, and the web site information is registered in thesearch engine after being reviewed by expert surfers.

(2) A category corresponding to the subject of a web site to beregistered is selected from a directory of categories classified bysubject, and it is requested that the web site be registered in theselected category, and then the web site is registered in the searchengine after being reviewed by expert surfers. Some search enginesprovide a fee-based directory registration service to reduce the timerequired to register a web site in their directory with a registrationfee.

Web sites registered in the search engine in the above method areprovided to a user who is looking for desired information after they aresearched for in various ways, such as integrated web search anddirectory search, based on search terms entered by the user. Theintegrated web search is also called “word-based search”, in whichUniversal Resource Locators (URLs) of all web sites are stored in adatabase and desired information is searched for based on a specifickeyword entered by the user. The directory search is also called“subject-based search”, in which web sites are organized intosubject-based categories and if a user links to a desired category, theuser can view detailed items thereof. In this manner, the subject-basedsearch allows the user to continue to link to the detailed items andretrieve desired information. For example, if a user desires to findKorean team match scores in the 2002 Korea-Japan World Cup, the user cansearch for them via categories such as Sports→Ball Sports→Soccer→FIFAWorld Cup→2002 Korea-Japan World Cup→Korean team match scores. FIG. 1 bis an example screenshot of the directory search method. As shown inthis figure, directory search results with search terms “world cup” arethree categories “World Cup”, “2002 FIFA Korea-Japan World Cup” and“History of the World Cup”, and the user can search for desiredinformation by moving to one of the three categories in which thedesired information is most likely to be placed. A typical search enginebased on the integrated web search method is Lycos(http://lycos.cs.cmu.edu) developed by Michael L. Mauldin atCarnegie-Mellon University, and a typical search engine based on thedirectory search method is Yahoo (http://www.yahoo.com). Many currentsearch engines provide hybrid search services based on a combination ofthe different search methods described above.

The conventional method for registering web sites in the search engineand searching for the registered web sites has the following problems.

As the number of Internet users has rapidly increased, the number ofusers who desire to search for specific information has rapidlyincreased and the number of types of information for which they desireto search has increased. As the number of such users and the types ofsuch information has increased, some search terms appear veryfrequently, which will also be referred to as “popular keywords”. Thiscauses a problem in that users, who desire to search for informationbased on the popular keywords, may receive information of web sites(hereinafter also referred to as “deceptive sites”) that containcontents of no use to the users and insert the popular keywords in theirweb pages in various ways. For example, if a user enters a popularkeyword “Pikachu” to search for information about the Pikachu,information of all registered web sites that contain the word “Pikachu”in their web pages is provided to the user. The web sites provided tothe user may include web sites that contain adult or sexual contents andinsert the word “Pikachu” in some places in their web pages in variousways (with ill intention in most cases). This popular keyword insertioncauses a wide age range of users to be exposed to the information of theweb sites that contain adult or sexual contents.

The conventional method for overcoming the problems described aboverequires complaint reports by users or requires specialists such asexpert surfers to constantly monitor the registered web sites, but theconventional method obviously cannot be an ultimate solution to theproblems. If an algorithm automatically executed on the Internet tosolve the problems can be provided, it will be a useful means to solvethe problems all at once.

DISCLOSURE OF THE INVENTION

Therefore, the present invention has been made in view of the aboveproblems, and it is an object of the present invention to provide amethod for managing web sites registered in a search engine, in which analgorithm is used to automatically detect deceptive sites, therebyallowing users of the search engine to correctly search for theirdesired information.

It is another object of the present invention to provide a method formanaging web sites registered in a search engine, in which deceptivesites are automatically detected, and punitive measures areautomatically imposed on operators of the detected deceptive sites,thereby reinforcing self-purification of the web sites registered in thesearch engine.

It is yet another object of the present invention to provide a methodfor managing web sites registered in a search engine, in which analgorithm is used to automatically detect deceptive sites andautomatically take punitive measures such as warning against thedetected sites, thereby saving a large amount of human resources thatmay otherwise have been wasted to detect the deceptive sites.

According to a preferred embodiment of the present invention to providea method for managing web sites registered in a search engine, saidmethod comprising the steps of: receiving web site information of theregistered web site, classifying the web site information bypredetermined fields, and recording the classified web site informationin a database; reading a source file constituting a web page of theregistered web site; analyzing the read source file; determining, basedon a predetermined basis, whether or not the registered web site is adeceptive site; and performing a control operation to performpredetermined processing on the registered web site if the web site isdetermined to be a deceptive site, wherein the source file is an HTML(Hypertext Markup Language) document.

In addition, according to a preferred embodiment of the presentinvention to provide a system for managing a web site registered in asearch engine, the system comprising: an interface module for performingdata communication with at least one terminal; a web site registrationmodule for receiving a web site registration request including web siteinformation of a predetermined web site from said at least one terminaland classifying the web site information by predetermined fields; adatabase for classifying and storing a predetermined keywordcorresponding to the web site and the web site information; a web siteanalysis module for extracting a source file constituting a web page ofthe web site, and analyzing the extracted source file; and a web sitemanagement module for determining, based on a predetermined basis,whether or not the web site is a deceptive site.

As described above, the term “deceptive site” used in the presentspecification refers to a web site that inserts predetermined keywordsin a source file of its web page in various ways and contains contentsentirely different from those to be searched for based on thepredetermined keywords. According to an embodiment of the presentinvention, the predetermined keywords inserted in the source file of theweb page may be popular keywords.

The term “popular keywords” refers to search words that appear veryfrequently, among search words entered by Internet users. The popularkeywords may continually vary depending on the Internet users' tendencyand social situations of the time. The popular keywords may includeharmful keywords containing socially harmful content, and some examplesthereof are “suicide”, “reject”, “gambling” and “conspiracy”.

BRIEF DESCRIPTION OF THE DRAWINGS

The above and other objects, features and other advantages of thepresent invention will be more clearly understood from the followingdetailed description taken in conjunction with the accompanyingdrawings, in which:

FIG. 1 a is a block diagram showing the configuration of a conventionalsystem for providing web site search engine services;

FIG. 1 b is an example screenshot of a directory search method that isone of the web site search methods provided by search engines;

FIG. 2 is a block diagram showing the configuration of a system formanaging web sites registered in a search engine according to apreferred embodiment of the present invention;

FIG. 3 is a flow chart showing a method for managing web sitesregistered in a search engine according to an embodiment of the presentinvention;

FIGS. 4 a to 4 k are various types of deceptive sites read by a searchrobot that travels over the web, in the method for managing web sitesregistered in the search engine according to a preferred embodiment ofthe present invention;

FIG. 5 is a flow chart showing a method for imposing a predeterminedpunitive measure on a registrant of a web site that is determined to bea deceptive site, in the method for managing the web sites registered inthe search engine, according to a preferred embodiment of the presentinvention; and

FIG. 6 is a block diagram showing the internal configuration of ageneral computer system that can be used in managing web pagesregistered in the search engine according to the present invention.

BEST MODE FOR CARRYING OUT THE INVENTION

A method for managing web sites registered in a search engine accordingto preferred embodiments of the present invention will now be describedin detail with reference to the accompanying drawings.

FIG. 2 is a block diagram showing the configuration of a system formanaging web sites registered in a search engine according to anembodiment of the present invention. As shown in FIG. 2, the systemaccording to the embodiment of the present invention includes aninterface module 201, a web site registration module 202, a web sitemanagement module 203, a web site information database 204, a web siteanalysis module 205 and a search robot 207. According to the embodimentof the present invention, the system for managing web sites registeredin the search engine may include a mail server 208 or an SMS server 209for sending a predetermined message to a registrant of a registered website. The mail server 208 and the SMS server 209 may be provided in asystem for providing search engine services or may be located in asystem operated by a third party. The interface module 201, othervarious modules, and the mail server 208 or the SMS server 209 areillustrated in FIG. 2 as separate entities. This illustration has beenmade only for easier explanation, and they may be the same entity. Theelements shown in FIG. 2 may also be physically located at the sameplace, or alternatively they may be physically located apart from eachother according to another embodiment of the present invention.

First, the interface module 201 functions to support data transmissionbetween the search engine registration management system and a computerterminal provided to a registrant who desires to register apredetermined web site in the search engine, and also functions tointerface between physical transmission equipment.

The web site registration module 202 functions to receive a request toregister the predetermined web site from the registrant, and also tocollect and classify information/data about the web site contained inthe web site registration request. The web site registration module 202may further include a billing module (not shown) for chargingpredetermined fees for the web site registration. The billing module mayoperate to charge different fees for a web site desired to beregistered, depending on the type of the web site (i.e., depending onwhether it is a general site containing general content or an adult sitecontaining adult content).

The web site management module 203 is a module for overall registrationmanagement of web sites according to the present invention. Based oninformation of the web sites collected by the search robot 207, the website management module 203 determines whether the web sites are inoperation in conformity with a standard based on which theirregistration has been permitted. If it is determined that the web siteis in inappropriate operation (i.e., it is a deceptive site), the website management module 203 automatically takes a predetermined measureagainst a registrant of the web site. The web site management module 203can interwork with the mail server 208 or the SMS server 209 to send anemail to the registrant of the deceptive site or to send an SMS messageto a mobile terminal of the registrant, thereby giving warning againstthe registrant for the inappropriate operation of the deceptive site.

The web site information database 204 functions to classify and recordinformation of the registered web sites. Various information, such asURLs, keywords, registrant information (registrant's name, address,email address, mobile terminal number, etc.), directory information, andthe like of the web sites, may be classified by the information fieldsand stored in the web site information database 204.

Information of a web site stored in the web site information database204 may be modified by a registrant of the web site and by a systemmanager. When content of a web site is changed, the web site informationdatabase 204 may automatically update information of the web site storedtherein, based on analysis results (for example, based on a new keywordcorresponding to a URL of the web site) of data collected by the searchrobot 207 even though a registrant of the web site does not directlymodify the stored information of the web site.

The web site analysis module 205 functions to analyze information of websites collected by the search robot 207. The type of data collected bythe search robot 207 and a method for analyzing the collected data willbe described below in detail with reference to FIG. 3.

The above elements of the system for managing web sites registered inthe search engine according to the embodiment of the present inventionare divided simply according to their functions for easier explanation,and the functional division of the elements has nothing to do withactual physical locations thereof. It is obvious to those skilled in theart that the above modules may be embodied not only as hardware but alsoas software using a specific code.

FIG. 3 is a flow chart showing a method for managing web sitesregistered in a search engine according to a preferred embodiment of thepresent invention. The method for managing the web sites registered inthe search engine according to the preferred embodiment of the presentinvention will now be described in detail with reference to FIG. 3 inconjunction with FIGS. 4 a to 4 k and FIG. 6.

The web site registration management method according to the preferredembodiment of the present invention is performed in the followingmanner, as shown in FIG. 3. A registrant, who desires to register apredetermined web site in the search engine, makes a request to registerthe web site with information of the web site (305). The information ofthe web site is classified by information fields (registrant's name,address, email address, mobile phone number, etc.) and recorded in a website information database (310), and the web site is registered in thesearch engine (315). This registration step 315 may be performed inseveral ways. For example, in one way, a web site is registered in thesearch engine upon request of a manager of the web site as describedabove. In another way, a web site is registered in the search enginebased on information of the web site obtained by the search robot thatrandomly travels over the web. In the former case, the registrant (i.e.,the manager) of the web site can request that the web site be registeredin a category closest to a subject (for example, “Pikachu” and “patentbar exam”) thereof decided by the registrant. After being reviewed byexpert surfers, the requested web site can be registered in the searchengine if it is determined that the requested web site satisfiespredetermined requirements (for example, quality of the web site ornoncommercial site requirements in case no registration fee is paid).The method for managing web sties registered in the search engineaccording to the present invention will be described, limited to thecase where the web site is registered in the search engine upon requestof the registrant of the web site. However, the method and system formanaging web sties registered in the search engine according to thepresent invention can also be applied to other various ways in which theweb site is registered in the search engine.

If the web site is registered, the search engine controls the searchrobot to read a source file constituting a web page of the registeredweb site and analyze the read source file (320).

According to the embodiment of the present invention, the source fileanalysis is based on HTML (Hypertext Markup Language) document analysis.In more detail, by analyzing tags in an HTML document of a web site, itcan be determined whether the web site is a deceptive site that insertspopular keywords (i.e., high frequency search words) in an HTML documentconstituting its web site. As well known to those skilled in the art,the HTML document is composed of instructions called “tags”, and a webdesigner or the like, who produces web pages, composes a web site usingthe tags, and includes content, which is desired to be provided via theweb site, in the web site.

FIGS. 4 a to 4 k are diagrams illustrating various embodiments of amethod for analyzing an HTML document of a web site at step 320 of FIG.3 a to determine whether the web site is a deceptive site that includesinappropriate character strings in tags contained in its HTML document.These figures illustrate various ways to detect whether a web site is adeceptive site, based on analysis of HTML document tags of the web site.A detailed description will now be given of how the HTML documentanalysis is performed in the method for managing web sites registered inthe search engine according to the present invention, with reference toFIGS. 4 a to 4 k.

(1) Deceptive Site Using String of the Same Color as Background Color

FIG. 4 a is an example deceptive site that contains character stringsenclosed by tags, which are the same color as the background color ofthe deceptive site. In this figure, the left images are screenshots ofweb sites displayed to users, and the right images are HTML source filesof the web sites displayed on the left side. As shown in FIG. 4 a,“#FFFFFF” is assigned to background color and “#FFFFFF” is also assignedto text color in the upper source file, so that text “Starcraft” and“Zolaman” are not viewed in the upper web site screen. In the samemanner, “#FFFFFF” indicating white is assigned to background color and“white” is also assigned to text color in the lower source file of FIG.4 a, so that text “Starcraft” and “Zolaman” are not viewed in the lowerweb site screen. As well known to those skilled in the art, the tag<body> shown in the source files of FIG. 4 a allows setting of variousattributes of text or background displayed on a web page. Tags may bemainly classified into container tags composed of start and end tags(for example, <body></body> or <font><font> shown in FIG. 4 a) andstandalone tags that do not require end tags. These tags may be used tocompose a web site in various ways. Accordingly, if the background colorof a web site is the same as the character string color thereof asdescribed above, the web site can be displayed on a search resultsscreen with the help of predetermined popular keywords even though itcontains content unrelated to the popular keywords.

(2) Deceptive Site Using String Contained in Redirection Page

FIG. 4 b is a diagram showing an example deceptive site using characterstrings contained in a redirection page. In this figure, the left imageis a screenshot of a web site displayed to users, and the right imagesare HTML source files of the web site displayed on the left side. Aswell known to those skilled in the art, the redirection setting allowsmovement from a connected web site to a new web site, and it can beembodied in source files as shown on the right side of FIG. 4 b. In FIG.4 b, the upper source file uses an http-equiv attribute in the meta tag.The meta tag is generally used to set automatic redirection to adifferent web page within a predetermined time specified in a “content”item of FIG. 4 b. Typically, if the address of a home page is changed,the meta tag is used to automatically redirect a user connecting to anold address of the home page to a new address thereof within apredetermined time after displaying the address change information. Themiddle and lower source files in FIG. 4 b use “self.location” and“location.replace” tags, respectively, to redirect from the current webpage to “http://www.naver.com”.

In the example deceptive site shown in FIG. 4 b that uses theredirection page, the upper source file including the meta tag insertspredetermined popular keywords “Starcraft” and “Zolaman” next to theredirection instruction, and the middle and lower source files insertthe predetermined popular keywords “Starcraft” and “Zolaman” next to thetag </script>.

These redirection pages use the tags to instruct movement to differentweb sites and thus text added next to the tags plays no role. However,the search robot provides search results determined based on thefrequency of occurrence of a specific character string in a web site,which may cause the subject of the web site to be determined differentlyfrom its original subject. Accordingly, if a redirection page containscharacter strings as described above, the web site can be displayed on asearch results screen with the help of popular keywords even though itcontains content unrelated to the popular keywords.

(3) Deceptive Site Using String in Title Tag

FIG. 4 c is a diagram showing an example deceptive site using characterstrings contained in a title tag. In this figure, the left images arescreenshots of web sites displayed to users, and the right images areHTML source files of the web sites displayed on the left side. As wellknown to those skilled in the art, the title tag is used to brieflydisplay the subject of a web site on the top of a web browser, and itcan be embodied in source files as shown on the right side of FIG. 4 c.In FIG. 4 c, the upper source file with a title tag, among the sourcefiles shown on the right side, includes a plurality of popular keywordssuch as “Starcraft” and “Zolaman” in the title tag, whereby a webbrowser is displayed as shown on the left side of FIG. 4 c. On the otherhand, the lower source file of FIG. 4 c uses a plurality of title tags,where a plurality of popular keywords such as “Starcraft” and “Zolaman”are contained in their start and end tags <title> and </title>.

The content in these title tags is not displayed on the web browser nomatter how long character strings the content contains. However, thesearch robot provides search results determined based on the frequencyof occurrence of a specific character string in a web site, which maycause the subject of the web site to be determined differently from itsoriginal subject due to the character strings contained in the titletag. Accordingly, as described above, if the length of a characterstring included in the title tag is more than a predetermined numericalvalue, or if the number of title tags is more than one, the web site canbe displayed on a search results screen with the help of popularkeywords even though it contains content unrelated to the popularkeywords.

(4) Deceptive Site Using String Contained in Meta Tag

FIG. 4 d is a diagram showing an example deceptive site using characterstrings contained in a meta tag. In this figure, the left image is ascreenshot of a web site displayed to users, and the right image is anHTML source file of the web site displayed on the left side.

As well known to those skilled in the art, the meta tag is used torepresent general information about an HTML document, such as an author,data of creation and keywords thereof, which is not displayed on thebody of a web page corresponding to the HTML document. Referring to thesource file on the right side of FIG. 4 d, the meta tag contains“description” as document name and a plurality of popular keywords suchas “Starcraft” and “Zolaman” as document content. The character strings,such as the popular keywords, contained in the meta tag are notdisplayed on the web page. However, the search robot provides searchresults determined based on the frequency of occurrence of a specificcharacter string in a web site, which may cause the subject of the website to be determined differently from its original subject.Accordingly, if a meta tag in a web site contains a character string,and the length of the character string is more than a predeterminednumerical value as described above, the web site can be displayed on asearch results screen with the help of popular keywords even though itcontains content unrelated to the popular keywords.

(5) Deceptive Site Using String Located at Frame Tag

FIG. 4 e is a diagram showing an example deceptive site using characterstrings located at a frame tag. In this figure, the left image is ascreenshot of a web site displayed to users, and the right image is anHTML source file of the web site displayed on the left side. As wellknown to those skilled in the art, the frame tag is used to split ascreen, on which a web page is displayed, into two or more frames.Referring to the source file on the right side of FIG. 4 e, a frame tag<FRAMESET ROWS=“ ”> is used to split the screen horizontally, whereinformation of the split screen ratio is inserted in “ ”. Characterstrings located next to the end tag </FRAMESET> of the frame tag includea plurality of popular keywords such as “Starcraft” and “Zolaman”. Thecharacter strings, such as the popular keywords, located next to the endframe tag have nothing to do with the splitting of the web page screen.However, the search robot provides search results determined based onthe frequency of occurrence of a specific character string in a website, which may cause the subject of the web site to be determineddifferently from its original subject. Accordingly, if a characterstring is located at a frame tag, and the length of the character stringis more than a predetermined numerical value as described above, the website can be displayed on a search results screen with the help ofpopular keywords even though it contains content unrelated to thepopular keywords.

(6) Deceptive Site Using String Contained in Form Tag

FIG. 4 f is a diagram showing an example deceptive site using characterstrings contained in a form tag. In this figure, the left image is ascreenshot of a web site displayed to users, and the right image is anHTML source file of the web site displayed on the left side. As wellknown to those skilled in the art, the form tag is used to define adesired form in a web page displayed with a web browser. Referring tothe source file on the right side of FIG. 4 f, the form tag may becomposed as “<form><input type=“button type” value=“displayedtext”></form>”. The source file includes a button type “hidden” to setno text to be displayed on a corresponding button. Character stringsshown in the source file, which are not displayed on the web page,include a plurality of popular keywords such as “Starcraft” and“Zolaman”. The character strings, such as the popular keywords,contained in the form tag have nothing to do with the definition of aform in the web page. However, the search robot provides search resultsdetermined based on the frequency of occurrence of a specific characterstring in a web site, which may cause the subject of the web site to bedetermined differently from its original subject. Accordingly, if thelength of a character string included in a form tag is more than apredetermined numerical value as described above, the web site can bedisplayed on a search results screen with the help of popular keywordseven though it contains content unrelated to the popular keywords.

(7) Deceptive Site Using String Contained in Div Tag

FIG. 4 g is a diagram showing an example deceptive site using characterstrings contained in a div tag. In this figure, the left image is ascreenshot of a web site displayed to users, and the right image is anHTML source file of the web site displayed on the left side. As wellknown to those skilled in the art, the div tag is used with a stylesheet, using general ID and class attributes. In the source file on theright side of FIG. 4 g, the div tag is described as “<divstyle=”display:none; . . . >”, where an attribute “style” defining astyle of character strings to be displayed on a web page is set as“display:none”, so that the character strings following the div tag arenot displayed on the web page. The character strings, such as popularkeywords, contained in the div tag have nothing to do with display ofthe web page on the screen. However, the search robot provides searchresults determined based on the frequency of occurrence of a specificcharacter string in a web site, which may cause the subject of the website to be determined differently from its original subject.Accordingly, if the length of a character string included in a div tagis more than a predetermined numerical value as described above, the website can be displayed on a search results screen with the help ofpopular keywords even though it contains content unrelated to thepopular keywords.

(8) Deceptive Site Using String Contained in a Href Tag

FIG. 4 h is a diagram showing an example deceptive site using characterstrings contained in an “a href” tag. In this figure, the left image isa screenshot of a web site displayed to users, and the right image is anHTML source file of the web site displayed on the left side. As wellknown to those skilled in the art, the “a href” tag is used to link aspecific word or image in a document to a location or address to moveto, so as to facilitate movement to a different location in the samedocument or to a different document or web site. Referring to the sourcefile on the right side of FIG. 4 h, the a href tag may be composed of“<a href=”a location or address to move to“> a link marking target</a>”.Since no location to move to and no link marking target is assigned inthe a href tag shown in FIG. 4 h, the a href tag is not executed as wellas content therein is not displayed on the web page. Character stringscontained in the a href tag not to be executed include a plurality ofpopular keywords such as “Starcraft” and “Zolaman”. The characterstrings, such as the popular keywords, contained in the href tag havenothing to do with linking or with display of the web page on thescreen. However, the search robot provides search results determinedbased on the frequency of occurrence of a specific character string in aweb site, which may cause the subject of the web site to be determineddifferently from its original subject. Accordingly, if the length of acharacter string included in an “a href” tag is more than apredetermined numerical value as described above, there is a risk thatthe web site may be displayed on a search results screen with the helpof popular keywords even though it contains content unrelated to thepopular keywords.

(9) Deceptive Site Using Link Farm

FIG. 4 i is a diagram showing an example deceptive site using a linkfarm. As well known to those skilled in the art, the link farm is mostlyused to increase the search engine ranking of a web page by generating anumber of reciprocal links to the web site and thus causing the searchengine to continually search for the web site. The link farm may berealized using the href tags described above.

There is a problem in directly determining the web site using the linkfarm to be a deceptive site. However, if a web site uses a link farmthat includes an excessive number of links more than a predeterminednumber to cause the search engine to continually search for popularkeywords in the web page, there is a need to detect the web site becauseit is highly likely to be a deceptive site.

(10) Deceptive Site Using String Contained in Font Tag

FIG. 4 j is a diagram showing an example deceptive site using characterstrings contained in a font tag. In this figure, the left image is ascreenshot of a web site displayed to users, and the right image is anHTML source file of the web site displayed on the left side.

As well known to those skilled in the art, the font tag is used to setthe font size of character strings. In the source file shown in FIG. 4j, a font size is set to “0” in a font tag, so that character stringscontained in the font tag are not displayed on the web page. In the casewhere the character strings, which are not displayed on a web page dueto its font size “0”, include a plurality of popular keywords such as“Starcraft” and “Zolaman”, the character strings, such as the popularkeywords, contained in the font tag have nothing to do with the displayof the web page on the screen. However, the search robot provides searchresults determined based on the frequency of occurrence of a specificcharacter string in a web site, which may cause the subject of the website to be determined differently from its original subject.Accordingly, if a font tag in a web site contains character stringswhose font size is zero as described above, the web site can bedisplayed on a search results screen with the help of popular keywordseven though it contains content unrelated to the popular keywords.

(11) Deceptive Site Using String Contained in Image Tag

FIG. 4 k is a diagram showing an example deceptive site using characterstrings contained in an img tag. In this figure, the left image is ascreenshot of a web, site displayed to users, and the right image is anHTML source file of the web site displayed on the left side.

As well known to those skilled in the art, the img tag is used to inserta specific image in a document. In the source file shown in FIG. 4 k,“a.gif” is assigned as an image file to be inserted. After assigning theimage to be inserted, the img tag generally specifies an attribute suchas a location or an alignment method of the image. In the case of FIG. 4k, such an attribute is specified using character strings. When theimage is displayed on the web browser, the attribute specified using thecharacter strings has no influence on the display of the image. In thecase where the character strings having no influence on the attribute ofthe image include a plurality of popular keywords such as “Starcraft”and “Zolaman”, the character strings, such as the popular keywords,contained in the img tag have nothing to do with the display on the webbrowser screen. However, the search robot provides search resultsdetermined based on the frequency of occurrence of a specific characterstring in a web site, which may cause the subject of the web site to bedetermined differently from its original subject. Accordingly, if thelength of a character string included in an img tag is more than apredetermined numerical value as described above, the web site can bedisplayed on a search results screen with the help of popular keywordseven though it contains content unrelated to the popular keywords.

At step 320 of FIG. 3, a tag or the like contained in an HTML documentcorresponding to a web site is analyzed, and the length of characterstrings contained in the tag or the like is measured as described in theabove embodiments.

At step 325, according to a predetermined basis based on thismeasurement result, it is determined whether the web site is a deceptivesite. For example, wherein the predetermined basis is whether or not theHTML document includes a character string of the same color asbackground color of the web page or wherein the predetermined basis iswhether or not a redirection tag in the HTML document includes acharacter string.

Examples of the predetermined basis at step 325 to determine whether theweb site is a deceptive site are as described above with reference toFIGS. 4 a to 4 k. For example, the predetermined basis may be whether ornot the HTML document includes a character string of the same color asbackground color of the web page, or whether or not a redirection tag inthe HTML document includes a character string.

According to a preferred embodiment of the present invention, a hybridof the analyses described above in the deceptive site types (1) to (11)is used as the predetermined basis at step 325, and if the analysisvalue is more than a predetermined value, it is determined that the website is a deceptive site. For example, if the number of title characterstrings contained in a title tag is more than one, 10 points may beadded to the analysis value for each string, and up to 70 points may beadded thereto. If a redirection page includes character strings, 70points may be added to the analysis value irrespective of the number ofthe character strings. For a link farm, 4 points per 50 links, up to 80points, may be added to the analysis value. If there are characterstrings whose font size is “0”, 5 points per 100 bytes of the characterstrings, up to 70 points, may be added to the analysis value. A sourcefile constituting a web page is analyzed in this manner, and if a totalanalysis value of a web site, calculated using points and weightedvalues obtained respectively based on the above various bases, is morethan 100 points, the web site may be determined to be a deceptive site.If the deceptive site determination is based on only one basis (forexample, a web site is determined to be a deceptive site since thenumber of character strings contained in a title tag of the web site is50), the determination is highly likely to be erroneous. It is thuspreferable that the deceptive site determination be made based on acombination of the various bases.

According to a preferred embodiment of the present invention, differentpredetermined bases for deceptive site determination may be applied toweb sites registered in a robot-based search engine and web sitesregistered in a directory-based search engine. For example, if sourcefile analysis of a web page corresponding to a web site registered inthe robot-based search engine shows that the web site belongs to threeof the 11 deceptive site types describe above, the web site isdetermined to be a deceptive site. On the other hand, a web siteregistered in the directory-based search engine is determined to be adeceptive site even if it belongs to only one of the 11 deceptive sitetypes. This is because directory-based search engine providers, comparedto robot-based search engine providers with web sites registered thereinwithout registration fees, need to return favor to registrants of websites since most of the directory-based search engine providers receiveregistration fees from the registrants.

If the web site is determined to be a deceptive site at step 325, theregistrant field of the database described above is searched to obtaininformation of a registrant of the web site (330). Contact informationof the registrant is extracted from the registrant information of theweb site (335). Warning is given to the registrant of the web site bysending an email or an SMS message to the registrant using the extractedcontact information (340). The warning will be described below in detailwith reference to FIG. 5.

According to another embodiment of the present invention, an imagedescribed in a tag of a web site may be analyzed at step 320. Forexample, pixels of the image are analyzed to extract RGB components ofthe pixels, and if the number of pixels of a specific color (forexample, yellow) of the extracted RGB components exceeds a predeterminedreference value (for example, if the number of pixels is 50% or more),the web site may be considered a site containing obscene content, basedon which it may be determined whether the web site is a deceptive site.

FIG. 5 is a flow chart showing a method for imposing a predeterminedpunitive measure on a registrant of a web site that is determined to bea deceptive site or an altered site, in the method for managing the websites registered in the search engine, according to a preferredembodiment of the present invention.

With reference to FIG. 5, a description will now be given of how apunitive measure is automatically taken against a web site when it isdetermined to be a deceptive site at step 325 of FIG. 3. If the web siteis determined to be a deceptive site, a web site management modulesearches a web site information database to obtain information of aregistrant of the web site (510), and the web site management modulereceives the registrant information (520 and 550). According to anembodiment of the present invention, the web site management moduleextracts contact information of the registrant, such as an email addressor a mobile terminal number thereof, from the received registrantinformation (530), and controls a mail server or an SMS server totransmit a predetermined message to a location corresponding to thecontact information (540).

According to another embodiment of the present invention, the web sitemanagement module extracts information of other registered web sites ofthe registrant from the registrant information (560), and then performsa control operation to automatically analyze the other web sitesregistered under the same registrant name (570). This is because theother web sites registered under the same registrant name are highlylikely to be deceptive sites operated based on the same or similarmethod. In this embodiment, if, based on the analysis of the otherregistered web sites, it is determined that they are deceptive sites,step 510 of FIG. 5 may be repeated.

According to a preferred embodiment of the present invention, if a website is determined to be a deceptive site based on the analysis anddetermination methods, the system for managing the registered web sitesmay operate to automatically send an email, an SMS message or the liketo a registrant of the web site to point out problems of the web siteand then request that the registrant of the web site correct theproblems within a grace period. In addition, the system may be set toautomatically perform the analysis and determination processes after thegrace period. If the problems of the web site have not been correctedeven after the grace period, a punitive measure, such as cancel of theregistration of the web site, may be taken against the registrantthereof. According to another embodiment of the present invention, apunitive measure such as a complicated registration procedure may beimposed on the registrant of the web site when the registrant requestsregistration of another web site at a later time.

Embodiments of the present invention further relate to computer readablemedia that include program instructions for performing variouscomputer-implemented operations. The media may also include, alone or incombination with the program instructions, data files, data structures,tables, and the like. The media and program instructions may be thosespecially designed and constructed for the purposes of the presentinvention, or they may be of the kind well known and available to thosehaving skill in the computer software arts. Examples ofcomputer-readable media include magnetic media such as hard disks,floppy disks, and magnetic tape; optical media such as CD-ROM disks;magneto-optical media such as floptical disks; and hardware devices thatare specially configured to store and perform program instructions, suchas read-only memory devices (ROM) and random access memory (RAM). Themedia may also be a transmission medium such as optical or metalliclines, wave guides, etc. including a carrier wave transmitting signalsspecifying the program instructions, data structures, etc. Examples ofprogram instructions include both machine code, such as produced by acompiler, and files containing higher level code that may be executed bythe computer using an interpreter.

FIG. 6 is a block diagram showing the internal configuration of ageneral computer system that can be used in managing web pagesregistered in the search engine according to the present invention.

The computer system includes any number of processors 640 (also referredto as central processing units, or CPUs) that are coupled to storagedevices including primary storage 660 (typically a random access memory,or “RAM”), primary storage 670 (typically a read only memory, or “ROM”).As is well known in the art, primary storage 660 acts to transfer dataand instructions uni-directionally to the CPU and primary storage 660 isused typically to transfer data and instructions in a bi-directionalmanner. Both of these primary storage devices may include any suitabletype of the computer-readable media described above. A mass storagedevice 610 is also coupled bi-directionally to CPU 640 and providesadditional data storage capacity and may include any of thecomputer-readable media described above. The mass storage device 610 maybe used to store programs, data and the like and is typically asecondary storage medium such as a hard disk that is slower than primarystorage. A specific mass storage device such as a CD-ROM 620 may alsopass data uni-directionally to the CPU. Processor 640 is also coupled toan interface 630 that includes one or more input/output devices such assuch as video monitors, track balls, mice, keyboards, microphones,touch-sensitive displays, transducer card readers, magnetic or papertape readers, tablets, styluses, voice or handwriting recognizers, orother well-known input devices such as, of course, other computers.Finally, processor 640 optionally may be coupled to a computer ortelecommunications network using a network connection as shown generallyat 650 With such a network connection, it is contemplated that the CPUmight receive information from the network, or might output informationto the network in the course of performing the above-described methodsteps. The above-described devices and materials will be familiar tothose of skill in the computer hardware and software arts.

The hardware elements described above may be configured (usuallytemporarily) to act as one or more software modules for performing theoperations of this invention.

INDUSTRIAL APPLICABILITY

According to a method for managing web sites registered in a searchengine, in which an algorithm is used to automatically detect deceptivesites, thereby allowing users of the search engine to correctly searchfor their desired information.

According to a method for managing web sites registered in a searchengine, in which deceptive sites are automatically detected, andpunitive measures are automatically imposed on operators of the detecteddeceptive sites, thereby reinforcing self-purification of the web sitesregistered in the search engine.

According to a method for managing web sites registered in a searchengine, in which an algorithm is used to automatically detect deceptivesites and automatically take punitive measures such as warning againstthe detected sites, thereby saving a large amount of human resourcesthat may otherwise have been wasted to detect the deceptive sites.

The foregoing descriptions of specific embodiments of the presentinvention have been presented for purposes of illustration anddescription. They are not intended to be exhaustive or to limit theinvention to the precise forms disclosed, and obviously manymodifications and variations are possible in light of the aboveteaching. The embodiments were chosen and described in order to bestexplain the principles of the invention and its practical application,to thereby enable others skilled in the art to best utilize theinvention and various embodiments with various modifications as aresuited to the particular use contemplated. It is intended that the scopeof the invention be defined by the claims appended hereto and theirequivalents.

1. A method for managing a web site registered in a search engine, saidmethod comprising the steps of: receiving web site information of theregistered web site, classifying the web site information bypredetermined fields, and recording the classified web site informationin a database; reading a source file constituting a web page of theregistered web site; analyzing the read source file; determining, basedon a predetermined basis, whether or not the registered web site is adeceptive site; and performing a control operation to performpredetermined processing on the registered web site if the web site isdetermined to be a deceptive site.
 2. The method according to claim 1,wherein the source file is an HTML (Hypertext Markup Language) document.3. The method according to claim 2, wherein the predetermined basis iswhether or not the HTML document includes a character string of the samecolor as background color of the web page.
 4. The method according toclaim 2, wherein the predetermined basis is whether or not a redirectiontag in the HTML document includes a character string.
 5. The methodaccording to claim 2, wherein the predetermined basis is whether or notthe length of a title tag included in the HTML document is more than apredetermined numerical value or whether or not the number of title tagsincluded therein is more than one.
 6. The method according to claim 2,wherein the predetermined basis is whether or not the length of acharacter string in a meta tag included in the HTML document is morethan a predetermined numerical value.
 7. The method according to claim2, wherein the predetermined basis is whether or not a character stringexists in a frame tag in the HTML document.
 8. The method according toclaim 2, wherein the predetermined basis is whether or not the length ofa character string included in a form tag in the HTML document is morethan a predetermined numerical value.
 9. The method according to claim2, wherein the predetermined basis is whether or not the length of thesame character strings in a div tag in the HTML document is more than apredetermined numerical value.
 10. The method according to claim 2,wherein the predetermined basis is whether or not an a href tag in theHTML document includes a character string other than a URL (UniversalResource Locator).
 11. The method according to claim 2, wherein thepredetermined basis is whether or not the HTML document includes linkswhich link web pages in the same web site, the number of said linksbeing more than a predetermined number.
 12. The method according toclaim 2, wherein the predetermined basis is whether or not the HTMLdocument includes a character string whose font size is zero.
 13. Themethod according to claim 2, wherein the predetermined basis is whetheror not the length of a character string included in an img tag in theHTML document is more than a predetermined numerical value.
 14. Themethod according to claim 2, wherein the predetermined basis includes atleast two of the predetermined bases defined in claims 3 to
 13. 15. Themethod according to claim 14, said step of determining whether or notthe registered web site is a deceptive site includes the steps of:maintaining predetermined weighted values corresponding respectively tothe predetermined bases; calculating respective point values of thepredetermined bases according to predetermined point calculation methodscorresponding respectively to the predetermined bases; calculatingrespective intermediate values of the predetermined bases by multiplyingthe calculated point values respectively by the weighted valuescorresponding to the predetermined bases corresponding to the calculatedpoint values; obtaining a sum of the calculated intermediate values ofthe predetermined bases; and determining whether or not the sum of therespective intermediate values is more than a predetermined value, anddetermining that the web site is a deceptive site if the sum is morethan the predetermined value.
 16. The method according to claim 2,wherein the predetermined basis is whether or not a combined color valueof RGB components of pixels included in an image file contained in animg tag in the HTML document is more than a predetermined value.
 17. Themethod according to claim 1, wherein the database includes a web siteregistrant field, and wherein said step of performing the controloperation to perform the predetermined processing comprises the stepsof: obtaining information of a registrant of the web site by searchingthe web site registrant field of the database; extracting contactinformation of the registrant from the registrant information of the website; and transmitting a message to a location corresponding to theextracted contact information.
 18. The method according to claim 17,wherein the contact information is a mobile terminal number or an emailaddress of the registrant of the web site, and wherein said step oftransmitting the message comprises the step of controlling an emailserver to send an email to the email address or the step of controllingan SMS server to send an SMS message to the mobile terminal number. 19.The method according to claim 1, wherein the database includes a website registrant field, and wherein said step of performing the controloperation to perform the predetermined processing comprises the stepsof: obtaining information of a registrant of the web site by searchingthe web site registrant field of the database; extracting a URL of adifferent web site registered by the registrant from the registrantinformation of the web site; and reading a source file constituting aweb page of a web site connected through the URL; analyzing the readsource file; determining, based on a predetermined basis, whether or notthe web site is a deceptive site; and performing a control operation toperform predetermined processing on the registered web site if the website is determined to be a deceptive site.
 20. A computer-readablerecording medium in which a program for performing the method defined inany one of claims 1 to 19 is recorded.
 21. A system for managing a website registered in a search engine, the system comprising: an interfacemodule for performing data communication with at least one terminal; aweb site registration module for receiving a web site registrationrequest including web site information of a predetermined web site fromsaid at least one terminal and classifying the web site information bypredetermined fields; a database for classifying and storing apredetermined keyword corresponding to the web site and the web siteinformation; a web site analysis module for extracting a source fileconstituting a web page of the web site, and analyzing the extractedsource file; and a web site management module for determining, based ona predetermined basis, whether or not the web site is a deceptive site.